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Abstract 

Linguistic structures exhibit a rich array of 
global phenomena, however commonly used 
Markov models are unable to adequately de¬ 
scribe these phenomena due to their strong 
locality assumptions. We propose a novel 
hierarchical model for structured prediction 
over sequences and trees which exploits global 
context by conditioning each generation deci¬ 
sion on an unbounded context of prior deci¬ 
sions. This builds on the success of Markov 
models but without imposing a fixed bound 
in order to better represent global phenom¬ 
ena. To facilitate learning of this large and un¬ 
bounded model, we use a hierarchical Pitman- 
Yor process prior which provides a recur¬ 
sive form of smoothing. We propose pre¬ 
diction algorithms based on A* and Markov 
Chain Monte Carlo sampling. Empirical re¬ 
sults demonstrate the potential of our model 
compared to baseline finite-context Markov 
models on part-of-speech tagging and syntac¬ 
tic parsing. 


1 Introduction 

Markov models are widespread popular techniques 
for modelling the underlying structure of natural lan¬ 
guage, e.g., as sequences and trees. However local 
Markov assumptions often fail to capture phenom¬ 
ena outside the local Markov context, i.e., when the 
data generation process exhibits long range depen¬ 
dencies. A prime example is language modelling 
where only short range dependencies are captured 
by finite-order (i.e. n-gram) Markov models. How¬ 
ever, it has been shown that going beyond finite 
order in a Markov model improves language mod¬ 
elling because natural language embodies a large 


array of long range depepndencies (Wood et al. 


2009a). While infinite order Markov models have 


been extensively explored for language modelling 
dGasthaus and Tehl |2010[ |Wood et al.| |2011[ ), this 
has not yet been done for structure prediction. 

In this paper, we propose an infinite-order Markov 
model for predicting latent structures, namely tag 
sequences and trees. We show that this expressive 
model can be applied to various structure prediction 
tasks in NLP, such as syntactic parsing and part-of- 
speech tagging. We propose effective algorithms to 
tackle significant learning and inference challenges 
posed by the infinite Markov model. 

More specifically, we propose an unbounded- 
depth, hierarchical, Bayesian non-parametric model 
for the generation of linguistic utterances and their 
corresponding structure (e.g., the sequence of POS 
tags or syntax trees). Our model conditions each de¬ 
cision in a tree generating process on an unbounded 
context consisting of the vertical chain of their an¬ 
cestors, in the same way that infinite sequence mod¬ 
els (e.g., oo-gram language models) condition on an 


unbounded window of linear context ( [Mocfiihashi 
land Surmtal|2007[|Wood et al.||2009b| ). 

Learning in this model is particularly challenging 
due to the large space of contexts and correspond¬ 
ing data sparsity. For this reason predictive distri¬ 
butions associated with contexts are smoothed using 
distribtions for successively smaller contexts via a 
hierarchical Pitman-Yor process, organised as a trie. 
The infinite context makes it impossible to directly 
apply dynamic programing for structure prediction. 
We present two inference algorithms based on A* 
and Markov Chain Monte Carlo (MCMC) for pre¬ 
dicting the best structure for a given input utterance. 

The experiments show that our generative model 



















obtains similar performance to the state-of-the- 


art Stanford part-of-speech-tagger (Toutanova and 


Manning, 2000[) for English and Swedish. For Dan¬ 


ish, our model outperforms the Stanford tagger, 
which is impressive given the Stanford parser uses 
many more complex features and a discriminative 
training objective. Our experiments on parsing show 
that our unbounded-context tree model adapts itself 
to the data to effectively capture sufficient context to 
outperform both a PCFG baseline as well as Markov 
models with finite ancestor conditioning. 


2 Background and related work 

The syntactic parse tree of an utterance can be gen¬ 
erated by combining a set of rules from a grammar, 
such as a context free grammar (CFG). A CFG is a 4- 
tuple Q = (T, N, S, 1Z), where T is a set of terminal 
symbols, N is a set of non-terminal symbols, S G N 
is the distinguished root non-terminal and 1Z is a 
set of productions (a.k.a., rewriting rules). A PCFG 
assigns a probability to each rule in the grammar, 
where YIb c ~^ B C\A ) = 1. The grammar 

rules are often in Chomsky Normal Form, taking ei¬ 
ther the form A^BC or A^-a where A, B. C 
are syntactic cagegories (nonterminals), and a is a 
word (terminal). 

Tag sequences can also be represented as a tree 
structure, without loss of generality, in which rules 
take the form A-tBaor4->a where A, B 
are POS tags, and a is a word. Hence tagging mod¬ 
els can be represented by restricted (P)CFGs. This 
unifies view to syntactic parsing and POS tagging 
will allow us to apply our model and inference al¬ 
gorithms to these problems with only minor refine¬ 
ments (see Figure [TJ. 

In PCFG, a tree is generated by starting with the 
root symbol and rewriting (substituting) it with a 
grammar rule, then continuing to rewrite frontier 
non-terminals with grammar rules until there are 
no remaining frontier non-terminals. When mak¬ 
ing the decision about the next rule to expand a 
frontier non-terminal, the only conditioning context 
used from the partially generated tree is the fron¬ 
tier non-terminal itself, i.e., the rewrite rule is as¬ 
sumed independent from the remainder of the tree 
given the frontier non-terminal. Our model relaxes 
this strong independence assumptions by consider¬ 


ing unbounded vertical history when making the 
next inference decision. This takes into account a 
wider context when making the next parsing deci¬ 
sion. 

Perhaps the most relevant work is on unbounded 


history language models ( 

Mochihashi and Sumita 

2007; Wood et al.] 

2009a 

). A prime work is Se- 


quence Memoizer (Wood et al. 2011) which con¬ 
ditions the generation of the next word on an un¬ 
bounded history of previously generated words. We 
build on these techniques to develop rich infinite- 
context models for structured prediction, leading to 
additional complexity and challenges. 

For syntactic parsing, several infinite extensions 
of probabilistic context free grammars (PCFGs) 
have been proposed (Liang et al. 2007 Finkel et akj 
2007). These approaches achieve infinite gram¬ 


mars by allowing an unbounded set of non-terminals 
(hence grammar rules), but still make use of a 
bounded history when expanding each non-terminal. 
An alternative method allows for infinite grammars 
by considering segmentation of trees into arbitrar¬ 
ily large tree fragments, although only a limited his¬ 


tory is used to conjoin fragments (Cohn et al. 2010 


Johnson et al.[ |2006[ ). Our work achieves infinite 


grammars by growing the vertical history needed to 
make the next parsing decision, as opposed to grow¬ 
ing the number of rules, non-terminals or states hor¬ 
izontally, as done in prior work. 

Earlier work in syntactic parsing has also looked 
into growing both the history vertically and the rules 
horizontally, in a bounded setting. (Johnson 1998) 
has increased the history for the parsing task by 
parent-annotation, i.e., annotating each non-terminal 
in the training parse trees by its parent, and then 
reading off the grammar rules from the resulting 
trees. (Klein and Manning] [2003| > have considered 
vertical and horizontal markovization while using 
the head words’ part-of-speech tag, and showed that 
increasing the size of the vertical contexts consis¬ 
tently improves the parsing performance. (Petrov 


|et al.| |2006| ), ( | Petrov and Klein| |20071 > and ( |Mat- 


suzaki et akl 2005) have treated non-terminal anno¬ 


tations as latent variables and estimated them from 
the data. 

Likewise, finite-state hidden Markov models 
(HMMs) have been extended horizontally to have 


countably infinite number of states (Beal et al. 





























































Figure 1: Examples of infinite-order conditioning 
and smoothing mechanism. The bold symbols (NN, 
ADV, fine) are the part of the structure being gener¬ 
ated, and the boxes correspond to the conditioning 
context, (a) Syntactic Parsing, and (b) Infinite-order 
HMM for POS tagging. 


2001). Previous works on applying Markov models 


to part-of-speech tagging either considered finite- 
order Markov models (Brants, |2000 1 , or finite-order 
HMM (Thede and Harper, 1999). We differ from 
these works by conditioning both the emissions and 
transitions on their full contexts. 


3 The Model 


Our model relaxes strong local Markov assumptions 
in PCFG to enable capturing phenomena outside of 
the local Markov context. The model conditions the 
generation of a rule in a tree on its unbounded verti¬ 
cal history, i.e., its ancestors on the path towards the 
root of the tree (see Figure [I]). Thus the probability 
of a tree T is 


P{T)= II G [u] (r) 

(u,r)gT 

where r denotes the rule and u its history, and 
G[ u ](.) is the probability of the next inference deci¬ 
sion (i.e., grammar rule) conditioned on the context 
u. In other words, a tree T can be represented as a 
sequence of context-rule events {(u, r) e T}. 

When learning such a model from data, a vector 
of predictive probabilities for the next rule G[ u ](.) 
given each possible vertical context u E U must 
be learned, where depending on the problem U can 
denote the set of chains of non-terminals N* or 



Figure 2: Part of the smoothing mechanism corre¬ 
sponding to Figure 1(a). Each node represents a dis¬ 
tribution G labeled with a context, and the directed 
edges demonstrate the direction of smoothing. The 
path in bold corresponds to the smoothing for the 
rule NP -> NN. 


chains of rules 7 Z*. As the context size increases, 
the number of events observed for such long con¬ 
texts in the training data drastically decreases which 
makes parameter estimation challenging, particu¬ 
larly when generalising to unseen contexts. Assum¬ 
ing our unbounded-depth model, we need suitable 
smoothing techniques to estimate conditional rule 
probabilities for large (and possibly infinite depth) 
contexts. We achieve smoothing by placing a hi¬ 
erarchical Bayesian prior over the set of probabil¬ 
ity distributions {G[ u ]} ue w- We smooth G r u i with a 


distribution conditioned on a shorter context GY 




where vr(u) is the suffix of u containing all but the 
earliest event. This ties parameters of longer histo¬ 
ries to their shorter suffixes in a hierarchical man¬ 
ner, and leads to sharing statistical strengths to over¬ 
come sparsity issues. Figure [I] shows our infinite- 
order Markov model and the smoothing mechanism 
described here. 

More specifically, we assume that a distribution 
with the full history G[ u ] is related to a distribu¬ 
tion with the most recent history G^^)] through the 
Pitman-Yor process PYP ( jWood et~ah| 2011} ): 


G[ e ] | d[ e ],C[ e ],H ~ PYP(do,co,H) 

^[u] I ^|u| J C|u| 1 G[7r(u)] ~ C |ll|> G[^(u)]) 


where PI denotes the base (e.g. uniform) distribu¬ 
tion, and e denotes the empty context. The Pitman- 
Yor process PYP(d , c, H) is a distribution over dis- 
















(a) u : S NP 



learning infinite-depth language models. It makes 
use of Chinese Restaurant Process (CRP) repre¬ 
sentation of the Pitman-Yor process in order to 
marginalize out distributions G u (Teh 2006) and 
learn the predictive probabilities P(r|u). 

Under the CRP representation each context corre¬ 
sponds to a restaurant. As a new (u, r) is observed in 
the training data, a customer is entered to the restau¬ 
rant, i.e., the trie node corresponding to u. When¬ 
ever a customer enters a restaurant, it should be de¬ 
cided whether to seat him on an existing table serv¬ 
ing the dish r, or to seat him on a new table and 
sending a proxy customer to the parent node in the 
trie to order r (i.e., based on (-7r(u),r)). Fixing a 
seating arrangement S and PYP parameters 0 for all 
restaurants (i.e., the collection of concentration and 
discount parameters), the predictive probability of a 
rule based on our infinite-context rule model is: 


(b) u : VERB 

Figure 3: log-log plot of rule frequency vs rank, il¬ 
lustrated for (a) syntactic parsing and (b) POS tag¬ 
ging. Besides the data distribution, we also show 
samples from three PYP distributions with different 
hyperparameter values, c, d. 


tributions, where d is the discount parameter, c is 
the concentration parameter, and FI is the base dis¬ 
tribution. Note that GY] depends on Gu-oq] which 
itself depends on etc. This leads to a hi¬ 

erarchical Pitman-Yor process prior where context- 
dependent distributions are hidden. The formulation 
of the hierarchical PYP over different length con¬ 
texts is illustrated in Figure [2] 

Figure [3] demonstrates the property of PYP and 
how its behaviour depends on discount d, and con¬ 
centration c parameters. Note that the PYP allows a 
good fit to data distribution compared to the Dirich- 
let Process (d = 0; as used in prior work) which 
cannot adequately represent the long tail of events. 


4 Learning 

Given a training tree-bank, i.e., a collection of utter¬ 
ances and their trees, we are interested in the pos¬ 
terior distribution over {G[ u ]} ue w- We make use of 
the approach developed in [Wood et al. (20111 for 
learning such suffix-based graphical models when 


P(r\e, S, 0) = H(r) 


P(r|u,S,0) 


K. ~ d \u\ty 

n |u| +C| U | 


+ 


Cju| + d\ u \t u 

n u + C| u | 


P(r|7r(u),S,0) 


where e?i u | and C| u | are the discount and concentra¬ 
tion parameters, n" , is the number of customers at 
table k served the dish r in the restaurant u (accord¬ 
ingly n “ is the number of customers served the dish 
r and n u is the number of customers), and t” is the 
number of tables serving dish r in the restaurant u 
(accordingly t u is the number of tables). 

The seating arrangements (the state of all restau¬ 
rants including their tables and customers sitting on 
each table) are hidden, so they need to be marginal¬ 
ized out: 

P(r\u,V) = J P(r\u,S,0)P(S,0\V)d(S,0) 

where V is the training tree-bank. We approximate 
this integral by the so called “minimal assumption 
seating arrangement” and the MAP parameter set¬ 
ting 0 which maximizes the corresponding data pos¬ 
terior. Based on the minimal assumption, a new ta¬ 
ble is created only when there is no table serving 
the desired dish in a restaurant u. That is, a proxy 
customer is created and sent to the parent node in 


















the trie vr(u) for each unique dish type (sequence of 
events). 

This approximation has been shown to recover 
interpolated Kneser-Ney smoothing, when applied 
to hierarchical Pitman-Yor process language model 


(Teh 2006). 


The parameter 6 is learned by maximising the 
posterior, given the seating arrangement correspond¬ 
ing to the minimal assumption. We put the follow¬ 
ing prior distributions over the parameters: d m ~ 
Beta (a m ,b m ) and c m ~ Gamma( a m , /3 m ). The 
posterior is the prior multiplied by the following 
likelihood term: 


n h n 


Md-iu, 

Mi- 


nnn^Hif 


i) 


where [a]£ denotes the generalised factorial func- 
tion[j] We maximize the posterior with the con¬ 
straints c m > 0 and d„, G [0,1) using the L-BFGS- 
B optimisation method ( |Zhu et ah 19971, which 
results in the optimised discount and concentration 
values for each context size. 


5 Prediction 


In this section, we propose algorithms for the chal¬ 
lenging problem of predicting the highest scoring 
tree. The key ideas are to compactly represent the 
space of all possible trees for a given utterance, and 
then search for the best tree in this space in a top- 
down manner. By traversing the hyper-graph top- 
down, the search algorithms have access to the full 
history of grammar rules. 

In the test time, we need to predict the tree struc¬ 
ture of a given utterance w by maximizing the tree 
score: 


argmaxP(T|P, w) = argmax P(r\u,T >) 

(u,r)eT 

The unbounded context allowed by our model makes 
it infeasible to apply dynamic programming, e.g. 
CYK (Cocke and Schwartz, 19701, for finding the 
highest scoring tree. CYK is a bottom-up algorithm 
which requires storing in a dynamic programming 
table the score of each utterance’s sub-span condi¬ 
tioned on all possible contexts. Even truncating the 


1 M° = Mb 1 = 1 and Mo = rii=o ( a + ib )• 



Figure 4: Hyper-graph representation of the search 
space. The gray areas are examples of two partial 
hypotheses in A* priority queue. 


context size to bound this term may be insufficient to 
allow CYK for prediction, due to the unreasonable 
computational complexity. 

The space of all possible trees for a given utter¬ 
ance can be compactly represented as a hyper-graph 
(Klein and Manning| 20011. Each hyper-graph node 
is labelled with a non-terminal and a sub-span of 
the utterance. There exists a hyper-edge from the 
nodes B[i,j] and C[j + 1. k] to the node A [i, k] if 
the rule A —>• B C belongs to the grammar (Figure 
[4]). Starting from the top node ,S'[0, A r ], our predic¬ 
tion algorithms search for the highest scoring tree 
sub-graph that covers all of the utterance terminals 
in the hyper-graph. Our top-down prediction algo¬ 
rithms have access to the full history needed by our 
model when deciding about the next hyper-edge to 
be added to the partial tree. 


5.1 A* Search 

This algorithm incrementally expands frontier nodes 
of the best partial tree until a complete tree is con¬ 
structed. In the expansion step, all possible rules 
for expanding all frontier non-terminals are consid¬ 
ered and the resulting partial trees are inserted into 
a priority queue (see Figure [4]), sorted based on the 
following score: 

Score(T + ) = log P(T) + log G U (A —>• B C) 

+ h{T + ,A^ B C,i,k,j\G') 


where T + is a partial tree after expanding a frontier 
non-terminal, P(T) is the probability of the current 























partial tree, G U (A —t B C) is the probability of ex¬ 
panding a non-terminal via a rule A —> B C in the 
full context u, and h is the heuristic function (i.e., 
the estimate of the score for the best tree complet¬ 
ing T + ). We use various heuristic functions when 
expanding a node A[i,j] in the hypergraph via a hy¬ 
peredge with tails B[i, k] and C[k + 1, j\. 

• Full Frontier: which estimates the completion 
cost by 

h(T + ,A^ B C,i,k,j\G') = 

J2 \ogP(A' ,i' G'\G') 

(A , ,i',j')eFr(T+) 

where Fr(T + ) is the set of frontier nodes of 
the partial tree, and G' is a simplified gram¬ 
mar admitting dynamic programming. Here we 
choose the PCFG used the base measure H in 
the root of the PYP hierarchy. Accordingly the 
log P terms can be computed cheaply using the 
PCFG inside probabilities. 

• Local Frontier: which only takes into account 
the completion of the following frontier nodes: 

/i(T + ,A-> BC,i,k,j\G') = 

log P(B, i, k\G') + logP(C, k + l,j\G') 

This heuristic focuses on the completion cost 
of the sub-span using the selected rule. 

The above heuristics functions are not admissible, 
hence the A* algorithm is not guaranteed to find the 
optimal tree. However the PCFG provides reason¬ 
able estimates of the completion costs, and accord¬ 
ingly with a sufficiently wide beam, search error is 
likely to be low. 


5.2 MCMC Sampling 

We make use of Metropolis-Hastings (MH) al¬ 
gorithm, which is a Markov chain Monte Carlo 
(MCMC) method, for obtaining a sequence of ran¬ 
dom trees. We then combine these trees to construct 
the predicted tree. 

In the MH algorithm, we use a PCFG as our pro¬ 
posal distribution Q and draw samples from it. Each 
sampled tree is then accepted/rejected using the fol¬ 
lowing acceptance rate: 


a(T, T') = min 


P(T')Q(T) \ 
’ P{T)Q{T') J 


where T' is the sampled tree, T is the current tree, 
P{T') is the probability of the proposed tree un¬ 
der our model, and Q(T') is its probability under 
the proposal PCFG. Under some conditions, i.e., de¬ 
tailed balance and ergodicity, it is guarantheed that 
the stationary distribution of the underlying Markov 
chain (defined by the MH sampling) is the distribu¬ 
tion that our model induces over the space of trees 
P. For each utterence, we sample a fresh tree for the 
whole utterance from a PCFG using the approach 
of ( Johnson et al.| 20071, which works by first com¬ 
puting the inside lattice under the proposal model 
(which can be computed once and reused), followed 
by top-down sampling to recover a tree. Finally the 
proposed tree is scored using the MH test, according 
to which the tree is randomly accepted as the next 
sample or else rejected in which case the previous 
sample is retained. 


Once the sampling is finished, we need to choose 
a tree based on statistics of the sampled collection of 
trees. One approach is to select the most frequently 
sampled tree, however this does not work effectively 
in such large search spaces because of high sampling 
variance. Note that local Gibbs samplers might be 
able to address this problem, at least partly, through 
resampling subtrees instead of full tree sampling (as 
done here). Local changes would allow for more 
rapid mixing from trees with some high and low 
scoring subtrees to trees with uniformly high scor¬ 
ing sub-structures. We leave local sampling for fu¬ 
ture work, noting that the obvious local operation 
of resampling complete sub-trees or local tree frag¬ 
ments would compromise detailed balance, and thus 


not constitute a valid MCMC sampler (Levenberg 


et all 2012). 


To address this problem, we use a Minimum 
Bayes Risk (MBR) decoding method to predict the 
best tree ( Goodman]11996 1 as follows: For each pair 
of a nonterminal-span, we record the count in the 
collection of sampled trees. Then using the Viterbi 
algorithm, we select the tree from the hypergraph for 
which the sum of the induced pairs of nonterminal- 
span is maximized. Roughly speaking, this allows to 
make local corrections that result in higher accuracy 
compared to the best sampled trees. 









Task 

Train 

Test 

Len 

Rules 

parse 

33180 

2416 

24 

31920 

pos EN 

38219 

5462 

24 

29499 

pos DN 

3638 

1000 

20 

5269 

pos SW 

10653 

389 

18 

9739 


Table 1: Statistics for PTB syntactic Parsing and 
part-of-speech tagging, showing the number of 
training and test sentences, average sentence length 
in words and number of grammar rules. For morph 
the numbers are averaged over the 10 folds. 


all < 40 


Syntactic Parser 

FI 

ACC 

FI 

ACC 

A* (Local Frontier) 

75.33 

16.12 

76.21 

16.85 

A* (Full Frontier) 

72.27 

13.14 

72.34 

13.57 

MCMC 

76.74 

18.23 

78.21 

18.99 

PCFG CYK 

58.91 

4.11 

60.25 

4.42 


Table 2: Syntactic parsing results for the Penn, tree- 
bank, showing labelled F-Measure (FI) and exact 
bracketing match (ACC). 


6 Experiments 

In order to evaluate the proposed model and predic¬ 
tion algorithms, we performed two sets of experi¬ 
ments on tasks with different structural complexity. 
The statistics of the tasks and datasets are provided 
in Table |T] 


6.1 Syntactic Parsing 

For syntactic parsing, we use the Penn, treebank 
(PTB) dataset ( |Marcus et al.[ |1993| ). We used the 
standard data splits for training and testing (train 
sec 2-21; validation sec 22; test sec 23). We fol¬ 
lowed Petrov et al. (20061 preprocessing steps by 
right-binarizing the trees and replacing words with 
count < 1 in the training sample with generic un¬ 
known word markers representing the tokens’ lexical 
features and position. The results reported in Table [2] 
are produced by EVALB. 

The results in Table [2] demonstrate the superior¬ 
ity of our model compared to the baseline PCFG. 
We note that the A* parser becomes less effective 
(even with a large beam size) for this task, which 
we attribute to the large search arising for the large 
grammar and long sentences. Our best results are 
achieved by MCMC, demonstrating the effective¬ 
ness of MCMC in large search spaces. 

An interesting observation is how our results 
compare with those achieved by bounded vertical 


and horizontal Markovization reported in (Klein 
and Manningl 2003). Our binarization corre¬ 


sponds to one of their simpler settings for horizontal 
markovization, namely h = 0 in their terminology, 
and note also that we ignore the head information 
which is used in their models. Despite this we still 
manage to equal their results obtained using verti¬ 
cal context of size 3 (v = 3), with 76.7 FI score. 


Their best result, F\ = 79.74, was achieved with 
h < 2, v = 3 (and tags for head words). We be¬ 
lieve that our model would outperform theirs if we 
consider greater horizontal markovization and incor¬ 
porate head word information. To facilitate a fair 
comparison with vertical markovization, we exper¬ 
imented with limiting the size of the vertical con¬ 
texts to 2, 3 or 4 within our model. Using MCMC 
parsing we found that performance consistently im¬ 
proved as the size of the context was increased, scor¬ 
ing 68.1, 71.1, 75.0 F-measure respectively. This 
is below 76.7 F-measure of our unbounded-context 
model which adapts itself to data to effectively cap¬ 
ture the right context. 

Overall our approach significantly outperforms 
the baseline PCFG, although note these results are 
well below the current state-of-the-art in parsing, 
which typically makes use of discriminative training 
with much richer features. We speculate that future 
enhancements could close the gap between our re¬ 
sults and that of modern parsers, while offering the 
potential benefits of our generative model which al¬ 
lows further incorporation of different types of con¬ 
texts (e.g., head words and n-gram lexical context). 


6.2 Part-of-Speech Tagging 

The part of speech (POS) corpora have been ex¬ 
tracted from PTB (sections 0-18 for training and 
22-24 for test) for English, and NAACL-HLT 2012 
Shared task on Grammar Inductior0 for Danish and 
Swedish ( Gelling et al.| |2012| ). We convert the 
sequence of part-of-speech tags for each sentence 
into a tree structure analogous to a Hidden Markov 
Model (HMM). For each POS tag we introduce 


http://wiki.cs.ox.ac.uk/InducingLinguisticStructure/ 
SharedTask 


























(i) 


PRON 


VERB 


ADV 


7 Conclusion and Future Work 


(ii) 


Figure 5: The analogy between HMM (i) and our 
representation (ii) for the part-of-speech tags of the 
sentence “that’s fine now." 



English Danish Swedish 


POS Tagger 

TL 

SL 

TL 

SL 

TL 

SL 

A*(Local Frontier) 

95.50 

54.11 

89.85 

35.10 

87.04 

32.13 

A*(Full Frontier) 

95.27 

53.88 

88.57 

32.6 

85.62 

28.53 

MCMC 

96.04 

54.25 

95.55 

72.93 

89.97 

34.45 

PCFG CYK 

94.69 

47.22 

89.04 

31.7 

89.76 

33.93 

Stanford Tagger 

97.24 

56.34 

93.66 

51.30 

91.28 

37.02 


Table 3: TL stands for Token-Level Accuracy, SL 
stands for Sentence-Level Accuracy. MCMC results 
are the average of 10 runs. 


a twin (e.g., ADJ’ for ADJ) in order to encode 
HMM-like transition and emission probabilities in 
the grammar. As shown in Figure [5J this represen¬ 
tation guarantees that all the rules in the structures 
are either in the form of t t —> tj t'- (transition) or 
t' —> word (emission). 


The tagging results are reported in Table [3j in¬ 
cluding comparison with the baseline PCFG (= 
HMM) and the state-of-the-art Stanford POS Tagger 
(Toutanova and ManningJ 20001, which we trained 
and tested on these datasets. As illustrated in Ta¬ 
ble [3j our model consistently improves the PCFG 
baseline. While for Danish we outperform the state- 
of-the-art tagger, the results for English and Swedish 
we are a little behind the Stanford Tagger. This is an 
promising result since our model is only based on 
the rules and their contexts, as opposed to the Stan¬ 
ford Tagger which uses complex hand-designed fea¬ 
tures and a complex form of discriminative training. 


Note the strong performance of MCMC sampling, 
which consistently outperforms A* search on the 
three tagging tasks. 


We have proposed a novel hierarchical model over 
linguistic trees which exploits global context by con¬ 
ditioning the generation of a rule in a tree on an un¬ 
bounded tree context consisting of the vertical chain 
of its ancestors. To facilitate learning of such a large 
and unbounded model, the predictive distributions 
associated with tree contexts are smoothed in a re¬ 
cursive manner using a hierarchical Pitman-Yor pro¬ 
cess. We have shown how to perform prediction 
based on our model to predict the parse tree of a 
given utterance using various search algorithms, e.g. 
A* and Markov Chain Monte Carlo. 

This consistently improved over baseline methods 
in two tasks, and produced state-of-the-art results for 
Danish part-of-speech tagging. 

In future, we would like to consider sampling the 
seating arrangements and model hyperparameters, 
and seek to incorporate several different notions of 
context besides the chain of ancestors. 
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